Named entity disambiguation
Formalization Mapping \Gamma: M \rightarrow E where M = \{m_1, m_2,..., m_k\} are name mentions, characterized by name m.S, context m.C and document m.D and E = \{e_1, e_2,..., e_n\} are entities in a knowledge base.Han, X., Sun, L., & Zhao, J. (2011, July). Collective entity linking in web text: a graph-based method. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval (pp. 765-774). ACM. Approaches Individual This approach assumes that the disambiguation of different mentions are independent. It solves the optimization problem: \Gamma^*_{local} = \arg \max_{\Gamma} \sum^k_{i=1}\phi(m_i, \gamma_i) where \gamma_i = \Gamma(m_i), \phi(m_i, \gamma_i) measure the compatibility between the mention m_i and the entity chosen by \Gamma . Features used in this approach may come from surrounding words, sentence or paragraph (e.g. DBpedia Spotlight). * Bag of Words (BoW), cosine similarity: Mihalcea & CsomaiMihalcea, R. & Csomai, A. 2007. Wikify!: linking documents to encyclopedic knowledge. In: Proceedings of the sixteenth ACM CIKM. * BoW + Categories: Cucerzan (2007)Cucerzan, S. 2007. Large-scale named entity disambiguation based on Wikipedia data. In: Proceedings of EMNLP-CoNLL., Bunescu & PascaBunescu, R. & Pasca, M. 2006. Using encyclopedic knowledge for named entity disambiguation. In: Proceedings of EACL, vol. 6., Fader et al.Fader, A., Soderland, S., Etzioni, O. & Center, T. 2009. Scaling Wikipedia-based named entity disambiguation to arbitrary web text. In: Proceedings of Wiki-AI at IJCAI. Some employed algorithms: * Classification: Zhang et al.Zhang, W., Su, J., Tan, Chew Lim & Wang, W. T. 2010. Entity Linking Leveraging Automatically Generated Annotation. In: Proceedings of the 23rd COLING. and Mihalcea & Csomai. * Learning-to-rank techniques: Zheng et al.Zheng, Z., Li, F., Huang, M. & Zhu, X. 2010. Learning to Link Entities with Knowledge Base. In: The Proceedings of NAACL., Dredze et al.Dredze, M., McNamee, P., Rao, D., Gerber, A. & Finin, T. 2010. Entity Disambiguation for Knowledge Base Population. In: Proceedings of COLING. and Zhou et al.Zhou, Y., Nie, L., Rouhani-Kalleh, O., Vasile, F. & Gaffney, S. 2010. Resolving Surface Forms to Wikipedia Topics. In: Proceedings of the 23rd COLING. TODO: a survey: http://www.jair.org/media/4129/live-4129-7870-jair.pdf Relational heuristics The idea was that the referent entity of a name mention should be coherent with its unambiguous contextual entities. * Medelyan et al.Medelyan, O., Witten, I. H. & Milne, D. 2008. Topic indexing with Wikipedia. In: Proceedings of the AAAI WikiAI workshop. determined the compatibility using the semantic relatedness between the candidate entity and the contextual entities. * Milne and WittenMilne, D. & Witten, I. H. 2008. Learning to link with Wikipedia. In: Proceedings of the 17th ACM CIKM. extended the method of Medelyan et al. by adopting learning-based techniques to balance the semantic relatedness, the commonness and the context quality. Collective Some algorithms disambiguate based on collective topic coherence only, for example BabelFyMoro, A., Raganato, A., & Navigli, R. (2014). Entity Linking meets Word Sense Disambiguation: A Unified Approach. Transactions of the Association for Computational Linguistics, 2, 231–244. Retrieved from http://www.transacl.org/wp-content/uploads/2014/05/54.pdf. Individual+Collective Collective approaches combine local information with global inference to achieve better results. Given a coherence function \psi that assigns a number for each assignment of the whole document, the optimization goal is now: \Gamma^*_{local} = \arg \max_{\Gamma} \left\gamma_i)\right + \psi(\Gamma) For some choices of coherence function, the problem has been proved to be NP-Hard.Kulkarni, S., Singh, A., Ramakrishnan, G. & Chakrabarti, S. 2009. Collective annotation of Wikipedia entities in web text. In: Proceedings of the 15th ACM SIGKDD Some references: * Pair-wise interdependence: Kulkarni et al.. * Directly connected entities: Han et al., Hoffart et al. (2011)Hoffart, J., Yosef, M. A., Bordino, I., Fürstenau, H., Pinkal, M., Spaniol, M., ... & Weikum, G. (2011, July). Robust disambiguation of named entities in text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 782-792). Association for Computational Linguistics. * All semantically related entities: Guo & Barbosa (2014)Guo, Z., & Barbosa, D. (2014, April). Entity linking with a unified semantic representation. In Proceedings of the companion publication of the 23rd international conference on World wide web companion (pp. 1305-1310). International World Wide Web Conferences Steering Committee. Reranking * Language model: Dalvi et al. (2014)Dalvi, B., Xiong, C., & Callan, J. (2014, July). A language modeling approach to entity recognition and disambiguation for search queries. In Proceedings of the first international workshop on Entity recognition & disambiguation (pp. 45-54). ACM. Features Local features Probability of context Encodes the context of the named entities, i.e. P(c|e) , where c'' is context of the named entity ''e. For a specific context, a higher probability will be assigned to the named entity which frequently appears with that context. Han and Sun (2011)Xianpei Han and Le Sun. A generative entity-mention model for linking entities with knowledge base. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 945–954. Association for Computational Linguistics, 2011. proposed an entity context model to estimate the distribution P(c|e) by encoding the context of an entity e'' in a unigram language model. They define the context as the surrounding window of 50 terms, and used this formula to find the entity context probability: P(c|e) = P_e(t_1)P_e(t_2)...P_e(t_n) where P_e(t) = \frac{Count_e(t)}{\sum_tCount_e(t)} , and Count_e(t) is the frequency of occurrence of term t in the context of the named entity ''e. Evaluation Datasets * Wikipedia articles: traditionally used but Kulkarni et al. (2009) pointed out that it is unsuitable to the evaluation of high-recall entity linking tasks because it annotates name mentions very sparsely (only the important name mentions are annotated). * TAC 2009McNamee, P. & Dang, H. T. 2009. Overview of the TAC 2009 Knowledge Base Population Track. In: Proceeding of Text Analysis Conference.: focuses on individual EL tasks in different documents, unsuitable for our collective EL settings. * IITB dataset: a set of documents (107 documents in total) collected from the web sites belonging to a handful of domains. For each document, its name mentions’ referent entities in Wikipedia are manually annotated to be as exhaustive as possible. In total, 17,200 name mentions are annotated, 161 name mentions per document on average. * KORE50 (Hoffart et al., 2012)Johannes Hoffart, Stephan Seufert, Dat Ba Nguyen, Martin Theobald, and Gerhard Weikum. 2012. KORE: keyphrase overlap relatedness for entity disambiguation. In Proc. of CIKM, pages 545–554., which consists of 50 short English sentences (mean length of 14 words) with a total number of 144 mentions manually annotated using YAGO2, for which a Wikipedia mapping is available. This dataset was built with the idea of testing against a high level of ambiguity for the EL task. * AIDA-CoNLL (Hoffart et al., 2011)Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino, Hagen Furstenau, Manfred Pinkal, Marc Spaniol, Bilyana Taneva, Stefan Thater, and Gerhard Weikum. 2011. Robust disambiguation of named entities in text. In Proc. of EMNLP, pages 782–792., which consists of 1392 English articles, for a total of roughly 35K named entity mentions annotated with YAGO concepts separated in development, training and test sets. Evaluation criteria * F-score Application * Information extraction * Knowledge base population * Semantic search See also * Resources for entity linking References Category:Entity linking